Instruction Scheduling
   HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, instruction scheduling is a
compiler optimization In computing, an optimizing compiler is a compiler that tries to minimize or maximize some attributes of an executable computer program. Common requirements are to minimize a program's execution time, memory footprint, storage size, and power con ...
used to improve
instruction-level parallelism Instruction-level parallelism (ILP) is the parallel or simultaneous execution of a sequence of instructions in a computer program. More specifically ILP refers to the average number of instructions run per step of this parallel execution. Disc ...
, which improves performance on machines with
instruction pipeline In computer engineering, instruction pipelining or ILP is a technique for implementing instruction-level parallelism within a single processor. Pipelining attempts to keep every part of the processor busy with some instruction by dividing inco ...
s. Put more simply, it tries to do the following without changing the meaning of the code: * Avoid
pipeline stall In the design of pipelined computer processors, a pipeline stall is a delay in execution of an instruction in order to resolve a hazard. Details In a standard five-stage pipeline, during the decoding stage, the control unit will determine whe ...
s by rearranging the order of instructions. * Avoid illegal or semantically ambiguous operations (typically involving subtle instruction pipeline timing issues or non-interlocked resources). The pipeline stalls can be caused by structural hazards (processor resource limit), data hazards (output of one instruction needed by another instruction) and control hazards (branching).


Data hazards

Instruction scheduling is typically done on a single
basic block In compiler construction, a basic block is a straight-line code sequence with no branches in except to the entry and no branches out except at the exit. This restricted form makes a basic block highly amenable to analysis. Compilers usually deco ...
. In order to determine whether rearranging the block's instructions in a certain way preserves the behavior of that block, we need the concept of a ''data dependency''. There are three types of dependencies, which also happen to be the three
data hazard In the domain of central processing unit (CPU) design, hazards are problems with the instruction pipeline in CPU microarchitectures when the next instruction cannot execute in the following clock cycle, and can potentially lead to incorrect compu ...
s: # Read after Write (RAW or "True"): Instruction 1 writes a value used later by Instruction 2. Instruction 1 must come first, or Instruction 2 will read the old value instead of the new. # Write after Read (WAR or "Anti"): Instruction 1 reads a location that is later overwritten by Instruction 2. Instruction 1 must come first, or it will read the new value instead of the old. # Write after Write (WAW or "Output"): Two instructions both write the same location. They must occur in their original order. Technically, there is a fourth type, Read after Read (RAR or "Input"): Both instructions read the same location. Input dependence does not constrain the execution order of two statements, but it is useful in scalar replacement of array elements. To make sure we respect the three types of dependencies, we construct a dependency graph, which is a
directed graph In mathematics, and more specifically in graph theory, a directed graph (or digraph) is a graph that is made up of a set of vertices connected by directed edges, often called arcs. Definition In formal terms, a directed graph is an ordered pa ...
where each vertex is an instruction and there is an edge from I1 to I2 if I1 must come before I2 due to a dependency. If loop-carried dependencies are left out, the dependency graph is a
directed acyclic graph In mathematics, particularly graph theory, and computer science, a directed acyclic graph (DAG) is a directed graph with no directed cycles. That is, it consists of vertices and edges (also called ''arcs''), with each edge directed from one ve ...
. Then, any topological sort of this graph is a valid instruction schedule. The edges of the graph are usually labelled with the ''latency'' of the dependence. This is the number of clock cycles that needs to elapse before the pipeline can proceed with the target instruction without stalling.


Algorithms

The simplest algorithm to find a topological sort is frequently used and is known as
list scheduling List scheduling is a greedy algorithm for Identical-machines scheduling. The input to this algorithm is a list of jobs that should be executed on a set of ''m'' machines. The list is ordered in a fixed order, which can be determined e.g. by the pri ...
. Conceptually, it repeatedly selects a source of the dependency graph, appends it to the current instruction schedule and removes it from the graph. This may cause other vertices to be sources, which will then also be considered for scheduling. The algorithm terminates if the graph is empty. To arrive at a good schedule, stalls should be prevented. This is determined by the choice of the next instruction to be scheduled. A number of heuristics are in common use: * The processor resources used by the already scheduled instructions are recorded. If a candidate uses a resource that is occupied, its priority will drop. * If a candidate is scheduled closer to its predecessors than the associated latency, its priority will drop. * If a candidate lies on the critical path of the graph, its priority will rise. This heuristic provides some form of look-ahead in an otherwise local decision process. * If choosing a candidate will create many new sources, its priority will rise. This heuristic tends to generate more freedom for the scheduler.


Phase order

Instruction scheduling may be done either before or after
register allocation In compiler optimization, register allocation is the process of assigning local automatic variables and expression results to a limited number of processor registers. Register allocation can happen over a basic block (''local register allocatio ...
or both before and after it. The advantage of doing it before register allocation is that this results in maximum parallelism. The disadvantage of doing it before register allocation is that this can result in the register allocator needing to use a number of registers exceeding those available. This will cause spill/fill code to be introduced, which will reduce the performance of the section of code in question. If the architecture being scheduled has instruction sequences that have potentially illegal combinations (due to a lack of instruction interlocks), the instructions must be scheduled after register allocation. This second scheduling pass will also improve the placement of the spill/fill code. If scheduling is only done after register allocation, then there will be false dependencies introduced by the register allocation that will limit the amount of instruction motion possible by the scheduler.


Types

There are several types of instruction scheduling: #''Local'' (''basic block'') ''scheduling'': instructions can't move across basic block boundaries. #''Global scheduling'': instructions can move across basic block boundaries. #''Modulo scheduling'': an algorithm for generating
software pipelining In computer science, software pipelining is a technique used to optimize loops, in a manner that parallels hardware pipelining. Software pipelining is a type of out-of-order execution, except that the reordering is done by a compiler (or in the ...
, which is a way of increasing instruction level parallelism by interleaving different iterations of an inner
loop Loop or LOOP may refer to: Brands and enterprises * Loop (mobile), a Bulgarian virtual network operator and co-founder of Loop Live * Loop, clothing, a company founded by Carlos Vasquez in the 1990s and worn by Digable Planets * Loop Mobile, an ...
. #''
Trace scheduling Trace scheduling is an optimization technique developed by Josh Fisher used in compilers for computer programs. A compiler often can, by rearranging its generated machine instructions for faster execution, improve program performance. It increas ...
'': the first practical approach for global scheduling, trace scheduling tries to optimize the control flow path that is executed most often. #''Superblock scheduling'': a simplified form of trace scheduling which does not attempt to merge control flow paths at trace "side entrances". Instead, code can be implemented by more than one schedule, vastly simplifying the code generator.


Compiler examples

The
GNU Compiler Collection The GNU Compiler Collection (GCC) is an optimizing compiler produced by the GNU Project supporting various programming languages, hardware architectures and operating systems. The Free Software Foundation (FSF) distributes GCC as free software ...
is one compiler known to perform instruction scheduling, using the (both instruction set and scheduling) or (only scheduling) flags. It uses descriptions of instruction latencies and what instructions can be run in parallel (or equivalently, which "port" each use) for each microarchitecture to perform the task. This feature is available to almost all architectures that GCC supports. Until version 12.0.0, the instruction scheduling in
LLVM LLVM is a set of compiler and toolchain technologies that can be used to develop a front end for any programming language and a back end for any instruction set architecture. LLVM is designed around a language-independent intermediate repre ...
/Clang could only accept a (called in LLVM parlance) switch for both instruction set and scheduling. Version 12 adds support for () for x86 only. Sources of information on latency and port usage include: * GCC and LLVM; * Agner Fog, who compiles extensive data for the
x86 architecture x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was int ...
; * InstLatx64, which uses
AIDA64 AIDA64 is a system information, diagnostics, and auditing application developed by FinalWire Ltd (a Hungarian company) that runs on Windows, Android, iOS, Windows Phone, Tizen, ChromeOS and Sailfish OS operating systems. It displays detailed inf ...
to collect data on x86 CPUs. LLVM's should be usable on all machines, especially to gather information on non-x86 ones.


See also

*
Branch predication In computer science, predication is an architectural feature that provides an alternative to conditional transfer of control, as implemented by conditional branch machine instructions. Predication works by having conditional (''predicated'') n ...
* Code generation *
Instruction unit The instruction unit (I-unit or IU), also called, e.g., instruction fetch unit (IFU), instruction issue unit (IIU), instruction sequencing unit (ISU), in a central processing unit (CPU) is responsible for organizing program instructions to be fetche ...


References


Further reading

* (''
Trace scheduling Trace scheduling is an optimization technique developed by Josh Fisher used in compilers for computer programs. A compiler often can, by rearranging its generated machine instructions for faster execution, improve program performance. It increas ...
'') * (''Percolation scheduling'') * (''Global scheduling'') * {{Compiler optimizations Compiler optimizations